POLI 381 Data Project – Data Quality Control

Author

Adam Cheng

1 Introduction

This project aims to study how economic performance influences public approval of national governments over time. The dataset has 9 variables covering 111 countries between 1990-2023.

Table 1: A sample of the dataset (filtered by USA).
Each column represents a variable, each row represents an observation, and each cell represents a value to its corresponding variable
country_name country_code year approval_smoothed approval_growth gdp_pc gdp_pc_growth unemployment_rate cpi_growth
United States USA 1990 53.51480 4.7219676 44378.52 NA 5.62 5.40
United States USA 1991 56.38072 5.3553726 43742.03 -1.434248 6.82 4.23
United States USA 1992 45.20927 -19.8143060 44659.15 2.096669 7.51 3.03
United States USA 1993 44.96474 -0.5408824 45286.94 1.405723 6.90 2.95
United States USA 1994 44.03471 -2.0683562 46537.36 2.761109 6.12 2.61
United States USA 1995 43.86478 -0.3859001 47220.96 1.468929 5.65 2.81

1.1 Country Coverage

Figure 1: Map showing the dataset’s geographical coverage. Countries covered by the dataset are shaded blue and countries with no observations are coloured white

Figure 1 shows the dataset covers a wide range of countries, enhancing the external validity of the analysis.

2 Data Quality Assessment

Variables will be individually examined in concise paragraphs due to the scale of the dataset and constraints of word count.

2.1 gdp_pc

GDP per capita in 2021 PPP USD ($) from The World Bank (2025) is a covariate accounting for countries’ base income level, calculated by: (Consumption+Investment+Government SpendingNet Exports)Population\frac{(\text{Consumption} + \text{Investment} + \text{Government Spending} - \text{Net Exports})}{\text{Population}} with currency converted to 2021 US purchasing power parity rates.

Table 2: Tables for gdp_pc.
Subsequent tables under this section will share the same format and comments will be kept brief
(a) Data diagnosis table. Each column represents a diagnostic metric for the variable and outliers are classified based on the 1.5 IQR rule
Variable Type Unique Values Total Values Missing Values Missing Proportion Outliers Count Zero Count
gdp_pc double 3730 3730 44 1.17 85 0
(b) Summary statistics table. Each column represents a summary statistic for the variable
Variable Mean Standard Deviation Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum
gdp_pc 23035.64 22353.66 534.81 1653.59 5936.38 14954.72 35821.23 62638.93 137821.4
(a) KDE plot of gdp_pc (Gaussian kenerl and a bandwidth of 2000 is used). The x-axis indicates values of the variable and the y-axis represents the proportion of data at each value, such that the area under the curve equals 1. A rug plot is provided at the bottom with each tick mark representing a value of the variable
(b) Quantile plot for gdp_pc. The x-axis indicates quantiles of a normal distribution and the y-axis represents quantiles of the variable. Points aligning with the red reference line suggest the variable follows a normal distribution. Red data points highlight the minimum/maximum values, labeled with their corresponding ISO3 country code, year, and value
Figure 2: Kernel density estimation (KDE) and quantile plots for gdp_pc.
Subsequent plots under this section will share the same format and comments will be kept brief

Figure 2 suggests the distribution of gdp_pc is right-skewed with considerable variation and no notable gaps in data. Table 2 (b) supports this skewness, as the mean ($23,035.64) exceeds the median ($14,954.72), reflecting the historically substantial income disparities between countries.

2.2 gdp_pc_growth

% change of gdp_pc from the previous year derived using R, the independent variable measuring economic performance of countries over time.

Table 3: Tables for gdp_pc_growth
(a) Data diagnosis table
Variable Type Unique Values Total Values Missing Values Missing Proportion Outliers Count Zero Count
gdp_pc_growth double 3619 3619 155 4.11 281 0
(b) Summary statistics table
Variable Mean Standard Deviation Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum
gdp_pc_growth 1.96 5.53 -64.42 -6.15 0.28 2.19 4.34 8.49 90.83
(a) KDE plot for gdp_pc_growth. Bandwidth is set to 0.5
(b) Quantile plot for gdp_pc_growth
Figure 3: KDE and quantile plots for gdp_pc_growth

Figure 3 suggests gdp_pc_growth has a roughly symmetric (slight left-skew) distribution with no notable gaps in data, numerous outliers (281, Table 3 (a)), and considerable variation that reflect positive and negative economic shocks useful to the analysis.

2.3 approval_smoothed

Approval of national government (% of survey respondents) smoothed via exponential smoothing from Carlin et al. (2023) is the dependent variable measuring public approval. approval_smoothed is measured by collecting survey data from numerous sources asking respondents from countries with competitive elections whether they approve their executives.

Table 4: Tables for approval_smoothed
(a) Data diagnosis table
Variable Type Unique Values Total Values Missing Values Missing Proportion Outliers Count Zero Count
approval_smoothed double 2389 2412 1362 36.09 22 0
(b) Summary statistics table
Variable Mean Standard Deviation Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum
approval_smoothed 47.32 14.54 3.94 25.59 36.96 46.47 56.51 73.34 95.65
(a) KDE plot for approval_smoothed. Bandwidth is set to 2.5
(b) Quantile plot for approval_smoothed
Figure 4: KDE and quantile plots approval_smoothed

Figure 4 suggests the distribution of approval_smoothed is roughly symmetric (slight right-skew) with no notable gaps in data. However, Table 4 (a) highlights rather extreme outliers and large missing data proportions (36.08%); this will be addressed later. Nonetheless, the remaining 1050 values are sufficiently large and representative for the analysis.

2.4 approval_growth

% change of approval_smoothed from the previous year derived using R, a covariate accounting for approval_smoothed’s relative changes that extends comparisons with gdp_pc_growth.

Table 5: Tables for approval_growth
(a) Data diagnosis table
Variable Type Unique Values Total Values Missing Values Missing Proportion Outliers Count Zero Count
approval_growth double 2314 2335 1439 38.13 192 22
(b) Summary statistics table
Variable Mean Standard Deviation Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum
approval_growth 2.03 24.58 -71.63 -23.47 -8.43 -0.51 7.47 34.02 455.31
(a) KDE plots for approval_growth. Bandwidth is set to 2
(b) Quantile plot for approval_growth
Figure 5: KDE and quantile plots for approval_growth

Figure 5 suggests the approval_growth distribution is right-skewed with no notable gaps in data and many extreme outliers requiring further examination. Also, Table 5 (a) reports 22 zero values, likely attributed to exponential smoothing of approval_smoothed instead of genuine stability in approval.

2.5 cpi_growth

% change of consumer price index (CPI) from the previous year from International Monetary Fund (2025), a covariate accounting for inflation. CPI measures the price level of a basket of goods and services typical households consume. The basket and its price are determined by household surveys and supplier data respectively.

Table 6: Tables for cpi_growth
(a) Data diagnosis table
Variable Type Unique Values Total Values Missing Values Missing Proportion Outliers Count Zero Count
cpi_growth double 1729 3520 254 6.73 361 1
(b) Summary statistics table
Variable Mean Standard Deviation Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum
cpi_growth 31.3 452.84 -16.12 -0.2 1.96 4.13 8.88 36.82 23773.13
(a) KDE plot for cpi_growth. Default bandwidth by R is applied.
(b) Quantile plot for cpi_growth. Minimum value label is hidden due to data point obstruction
Figure 6: KDE and quantile plots for cpi_growth

Figure 6 (b) indicates an extreme outlier (Congo during its civil war in 1994) flattening the distribution. Logarithmic transformation is required.

(a) KDE plot for cpi_growth_log10. Bandwidth is set to 0.05
(b) Quantile plot for cpi_growth_log10
Figure 7: KDE and quantile plots for log10(x+17.12)log_{10}(x+17.12) of cpi_growth (cpi_growth_log10). cpi_growth was added by a constant of 17.1217.12 to shift the minimum value (16.12-16.12) to 11, ensuring only positive values are used for log transformation

Following the transformation in Figure 7, the distribution is right-skewed with no notable gaps in data. Figure 7 (b) also reveals many outliers with greater concentration in the right tail, reflecting the historical tendencies for inflationary crises.

2.6 unemployment_rate

Unemployment rate (% of labour force) from International Monetary Fund (2025), a covariate accounting for unemployment. The variable is measured using labour force surveys, where respondents without work but available and actively seeking work are considered unemployed.

Table 7: Tables for unemployment_rate
(a) Data diagnosis table
Variable Type Unique Values Total Values Missing Values Missing Proportion Outliers Count Zero Count
unemployment_rate double 1388 2755 1019 27 156 0
(b) Summary statistics table
Variable Mean Standard Deviation Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum
unemployment_rate 8.29 5.77 0.04 2.08 4.39 6.84 10.31 19.71 38.8
(a) KDE plot for unemployment_rate. Bandwidth is set to 0.7
(b) Quantile plot for unemployment_rate.Minimum value label is hidden due to data point obstruction
Figure 8: KDE and quantile plots for unemployment_rate

Figure 8 suggests the unemployment_rate distribution is right-skewed with no notable gaps in data, many outliers and a heavier right tail, reflecting numerous historically high unemployment crises. Table 7 (a) reports 27% missing proportions, which consistently fluctuates over time as highlighted by Figure 9, likely due to irregular survey frequencies.

2.7 Visualizing Missingness

Figure 9: Proportion of missing data for each variable (y-axis) over time (x-axis). Dervied variables are omitted because they share the same missingness patterns as their source variable

Figure 9 suggests approval_smoothed has the most missing data over time, which will be further explored in Figure 10. Missingness for other variables is negligible.

Figure 10: Heatmap of missing approval_smoothed values by country (y-axis) and year (x-axis). Blue squares indicate missing values and white squares indicate observed values. Countries are grouped by quartiles of their mean gdp_pc between 1990-2023

Figure 10 suggests missing approval_smoothed data correlates with countries’ gdp_pc, with higher-income countries generally exhibiting less missing data overall—likely due to greater resource capacity for data collection. However, missing data still declined for all countries over time, suggesting confounders like technological advancements likely reduced the cost of survey administration for all countries over time. Other factors, like warfare, can also affect data availability. Recent missing data may reflect delayed data reports as they vary by survey source.

3 Independent and Dependent Variable Relationship Assessment

3.1 Theoretical Conjecture

Denoting Variables:

  • Independent variable (XX): % change of GDP per capita in 2021 PPP USD from the previous year (gdp_pc_growth).
  • Dependent variable (YY): Approval of national government smoothed via exponential smoothing (approval_smoothed).

Relationship Rationale:

  • XX is often assumed to be positively correlated with YY because economic performance is a core responsibility of the government that impacts the living standards, happiness, and even survival of the individuals who dictate public approval of their government. Furthermore, individuals often associate their financial hardships with poor governance regardless of whether the government is directly responsible.

Necessary Considerations:

  • Non-linearity: The relationship between XX and YY may not be strictly linear because it can reach a threshold—similar to a concave utility function—where further economic development produces diminishing returns of approval as individuals substitute to other concerns responsible by the government. Potential solutions to address this include

    • Non-linear regression (e.g. Lowess): Visualizes non-linear patterns on scatter plots.
    • Variable transformation (e.g. logarithmic, exponential): Reduces non-linearity, simpler relationship assessment.
  • Endogeneity: Occurs when the relationship between XX and YY is confounded by unexplained variables influencing YY. Omitting these variables from the analysis can result in misleading conclusions about the bivariate relationship. To address this omitted variable bias, I included covariates in the dataset to control for alternative explanations.

  • Confounds: Changes in YY may not be solely due to XX, confounders such as cpi_growth or unemployment_rate may also influence YY or both XX and YY.

    • Consider gdp_pc (more confounds explored in stage 6): The correlation between XX and YY may be weaker for high-gdp_pc countries because economic performance matters to individuals less compared to those in low-gdp_pc countries. To address this, I will
      1. Group by countries then calculate for the mean gdp_pc,
      2. Stratify countries into intervals of mean gdp_pc (e.g. quartiles),
      3. Run regressions for XX and YY in each stratum and analyze changes in correlation using necessary metrics and visualizations.

3.2 Graphical Examination

This section will thoroughly examine the underlying distributions between XX and YY using visualizations.

Figure 11: KDE plot of gdp_pc_growth and approval_smoothed z-scores (ensures a comparable scale) using a bandwidth of 0.15. Rug plots color-coded and separated by variable are added at the bottom, with tick marks representing values

Figure 11 indicates both variables share similar distribution shapes, i.e., unimodal and roughly symmetric. While YY appears more spread out overall and XX is more concentrated at its centre, the rug plots reveal XX exhibiting a wider range with more extreme outliers on both ends.

(a) Q-Q plot for gdp_pc_growth and approval_smoothed with default axes limits
(b) Q-Q plot for gdp_pc_growth and approval_smoothed with axes limits set to (-6, 6) for clearer visual assessment of the distribution
Figure 12: Quantile-Quantile (Q-Q) plots comparing quantiles of gdp_pc_growth (x-axis) and approval_smoothed (y-axis) z-scores. Standardization is used to ensure comparability and easier interpretation of the 45-degree line

Figure 12 reveals that while the centres of both distributions are similar (roughly following the red reference line), there are substantial deviations in the tails. Specifically, an S-shaped pattern is observed with noticeable “plateaus” at the extremes (YY quantiles increasing slower than XX quantiles), suggesting YY has a heavier tail and extreme cases occur more consistently. However, XX quantiles span a wider range, indicating higher extremes. The consistency of extreme YY values may suggest measurement validity issues.

Since YY relies on survey data, concerns about measurement are reasonable. Specifically, the consistency of extreme cases can stem from non-random selection of survey respondents. While all survey sources used by Carlin et al. (2023) are reputable (e.g. Gallup World Poll, Eurobarometer), they utilize voluntary surveys that naturally invite biases. Some biases include voluntary response bias, where individuals with stronger opinions respond more often (leading to extreme cases), and non-response bias, where individuals not responding—often with different beliefs—lack representation in the sample.

Despite these prevalent biases, Carlin et al. (2023) draw from a wide range of survey sources, offering a larger sample and reducing sampling variability, which helps to mitigate these biases. Winsorization could be a solution to further reduce the effects of biases from YY and extreme outliers from XX by replacing extreme values with percentile values (e.g. 5th and 95th percentile) to constrain the data within a more realistic range. Figure 13 shows how Figure 12 is altered after a 90% winsorization on XX and YY,

Figure 13: Q-Q plot of gdp_pc_growth and approval_smoothed z-scores after 90% winsorization. Axes limits are set to (-2.5, 2.5) with no data points beyond the limit

effectively “trimming” the tails that initially produced the S pattern and emphasizing the middle quantiles following the reference line instead. Although winsorization can reduce the impact of outliers, it suffers from a trade-off of omitting potentially meaningful outliers for simplicity.

Figure 14: Line plot for z-scores of mean gdp_pc_growth and approval_smoothed for all countries across time. Standardization ensures variables are on comparable scales and fluctuations in values can be easily observed

Figure 14 indicates that XX and YY fluctuate over time, with extreme shifts accurately reflecting major global events (e.g. COVID). YY appears to follow the trend of XX but with a lag of ~3-4 years. The observed lag may arise from irregular survey administration frequencies creating temporal gaps between shocks to XX and recording of YY, distorting the immediate effects between the variables. Delays in economic shocks impacting public perception may also contribute to the lag: while XX can easily quantify immediate economic changes, YY operates in different scales as respondents vary in their tolerance and optimism for the government, often leading to a delayed change in their attitude as they observe how their government handles economic shocks.

The lagged relationship will be explored in stage 6 by accounting for lagged XX values by intervals of years (tt) when running regression models (e.g. distributed lag and autoregressive models).

Figure 15: Line plots for z-scores of mean gdp_pc_growth and approval_smoothed for all countries across time. Plots are separated by lagged variants of gdp_pc_growth

Figure 15 indicates Xt1X_{t-1}, Xt2X_{t-2}, Xt3X_{t-3}, and Xt4X_{t-4} all better overlap with YtY_t than XtX_t, indicating they may better explain YtY_t and should be explored further as independent variables when running regressions.

4 Conclusion

While current evidence suggests a weak but evident correlation between XX and YY, accounting for non-linearity and controlling for confounds may reveal stronger relationships. However, it is vital to first address underlying issues with YY—such as lag and non-random sampling biases—to ensure the analysis avoids the perils of “garbage in, garbage out”.

References

Carlin, R. E., Hartlyn, J., Hellwig, T., Love, G. J., Martı́nez-Gallardo, C., Singer, M. M., … Sert, H. (2023). Executive Approval Database 3.0. Retrieved from https://executiveapproval.org/
International Monetary Fund. (2025). International Financial Statistics (IFS). Retrieved from https://data.imf.org/?sk=4c514d48-b6ba-49ed-8ab9-52b0c1a0179b
The World Bank. (2025). World Development Indicators. Retrieved from https://datacatalog.worldbank.org/search/dataset/0037712/World-Development-Indicators